Skip to main content

About pipelines

Data pipeline is a structured workflow that facilitates the movement and transformation of data from various data sources into your central warehouse and also in another data source. A pipeline can be compared to a conveyor belt that picks up data, performs specific actions on it, and then deposits it in its designated location.

Data pipelines are used to perform data integration, enable the data movement and transformation between a source system and a target repository to create a single, complete picture of your dataset. You can then create data visualizations and dashboards to derive and share actionable insights from your data.

Creating data pipeline

DataGOL offers three types of pipelines to cater to different data integration needs:

  • Standard pipeline performs an exact replication of data sources into the warehouse. This is the most straightforward way to move data from a single data source into your warehouse. To accommodate users who may not require all warehouse columns, the system will allow them to choose which columns they want from the table.

  • Custom pipelines offer the flexibility to define your own data integration logic through queries. Imagine a scenario where a company's data resides across multiple databases. To gain a unified view, you might need to combine data from these different sources. With a custom pipeline, you essentially write the query that specifies how to extract and transform data from your source systems. This allows you to perform operations like JOINS to combine data from two or more databases.
    You define the query that acts as the data source, and then, just like with standard pipelines, you specify the target data warehouse or another data source such as RDBMS also, where the processed data will be written. This approach provides a powerful way to handle complex data integration requirements involving multiple and disparate data sources.

  • Dedup pipeline focuses on removing duplicate records from your existing warehouse data.
    When dealing with data integration, especially through custom pipelines that might involve combining data from multiple sources, the issue of duplicate records can arise. To address this, a deduplication process or pipeline step can be employed.
    The primary purpose of Dedup is to identify and remove the duplicate entries, ensuring that only unique and distinct values are persisted in the target data warehouse. This results in cleaner and more reliable data for analysis and reporting.

tip

DataGOL recommends using the Athena query engine when both the source and destination of the pipeline are Amazon S3. For all other pipeline configurations, DataGOL suggests utilizing the Spark query engine.